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1 Physical Design Cycle 

This chapter talks about the Physical Design Cycle for VLSI and FPGAs. The VLSI chip 
design cycle includes the steps of system specification, functional design, logic design, 
circuit design, physical design, febrication and packaging. The physical design 
automation of FPGA involves three steps, which inchide partitioning, placement, and 
routing. The VLSI and FPGA design cycles follow brief introduction. 

1.1 Introduction 

Despite advances in VLSI design automation, the time it takes to market for a chip is 
unacceptable for many appUcations. The key problem is time taken due to fabrication of 
chips and therefore there is a need to find new technologies, which mimmize the 
fabrication time. Gate Arrays use less time in febrication as compared to fiill custom 
chips since only routing layers are fabricated on top of pre-fabricated wafer. However 
fabrication time for gate arrays is still unacceptable for several q)pUcations. In order to 
reduce the time to fabricate interconnects; programmable devices have been mtroduced 
which allow usrars to program the devices as well as intenxnmect 

FPGA is a new approach to ASIC design that can dramatically reduce manufacturing turn 
around time and cost In its simplest form an FPGA consists of regular array of 
programmable logic blocks interconnected by a progiammable routing network. A 
programmable logic block is a RAM and can be programmed by the user to act as a small 
logic module. The key advantage of FPGA is re-programmability. 

hi this section we first talk about the classical VLSI Physical Design cycle and then 
introduce the physical design cycle for FPGAs. 

The VLSI chip design cycle includes the steps of system specification, functional design, 
logic design, circuit design, physical design, fabrication and packagmg. We focus on the 
physical design cycle. First we give definitions of the steps of partitiomng, floor 
planning, placement, routing and compaction that are mcluded in physical design then we 
explain the routing step in more detaU with different approaches of routing algonthms 
and their comparisons. 

The physical design automation of FPGA involves three steps, which include 
partitioning, placement, and routing. Partitioning problem m FPGAs is significantly 
different than the partitioning problem s m other design styles. This problem depends on 
the architecture m which the circuit has to be implemented. Placement problem m FPGAs 
is very similar to the gate array placemoit problem. The routing problem in FPGAs is to 
find a connection path and program the appropriate interconnection points. We will 
discuss the architecture of FPGAs and their physical design cycle. 
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1.2 Physical Design Cycle for VLSI 

In this step the circuit representation of each component is converted into a geometnc 
representation. This representation is a set of geometric patterns, which perform the 
intended logic function of flie corresponding component Comiections between different 
components are also expressed as geometric patterns. Physical design is a very complex 
process and flierefore it is usually broken into various subsets. 

The input to the physical design cycle is the circuit diagram and the output is the layout 
of the circuit. This is accomplished in several stages such as partitionmg, floor planning, 
placement, routing and compaction. Each of these stages wiU be discussed in detail but to 
pve an overview, a brief description of all stages are given here. ^ 

1.2.1 Partitioning 

A chip may contain several transistors. Layout of the entire circuit cannot be handled due 
to the limitation of memory space as well as computation power available. Therefore it is 
normally partitioned by grouping the components into blocks. The acti^ P^*'**?^^ 
process considers many fectors such as the size of th? blocks, number of blocks, and the 
number of interconnections between the blocks. The set of interconnecfaons required is 
referred as a net list. In large circuits the partitioning process is hierarchical and at the 
topmost level a chip may have 5 to 25 blocks. Each block is then partitioned recursively 
into smaller blocks. 

1.2.2 Floor planning and Placement 

This step is concerned with selecting good layout alternatives for each block as well as 
the entire chip. The area o/ each block can be estimated after partitionmg and »s based 
approximately on the number and type of commonness in that block. In ^dih^ 
interconnect area required within the block must also be considered. Very often die task 
of floor plan layout is done by a design engineer rather than a CAD tool due to the fact 
that human is better at visuaKzing the entire floor plan and take mto account the 
information flow. In addition certain components are often required to be located at 
specific positions on the chip. During placement the blocks are exactly Pos>boned °n tiie 
chip ThVgoal of placement is to find minimum area arrangement for tfie blocks that 
allows completion of interconnections between the blocks while meetmg the performance 
constraints. Placement is usually done in two phases. In the first phase mitial placement 
is done, hi the second phase the initial placement is evaluated and iterative unprovements 
are made until layout has minimum area or best performance. 

The quality of placement wiU not be clear untU the routing phase has been completed. 
Placement may lead to un-routable design, hi that case another iteration of placem^t is 
necessary. To limit the number of iterations of the placement algontlun an estimate of the 
required routing, space is used during the placement process. A good roufang and cttou 
^ormance heavily depaid on a good placement algorithm. This is due to the feet ^ 
once the position of tiie block is fixed; there is not much to do to unpiove the routmg and 
the circuit performance. 



1119432 



5 



1.2.3 Routing 

The objective of routing is to complete the interconnection between the blocks according 
to the specified net UsL First the space that is not occupied by the blocks (routing space) 
is partitioned into rectangular regions called channels and switchboxes. This includes the 
space between the blocks. The goal of the router is to complete all circuit connections 
using the shortest possible wire'length and using only tiie channel and switch boxes. This 
is usuaUy done in two phases referred as global routing and detiriled routing phases, to 
global routing connections are completed between the proper blocks disregardmg the 
Ixact geometric details of each wire. For each wire global router finds a Ust of channels 
and switchboxes to be used as passageway for that wire. Detailed routing that completes 
point-to-point connections foUows global routing. Global routing is converted mto exact 
routing by specifying the geometric information such as location and spacmg of wires. 
Routing is a very weU defined stiidied problem. Since almost all routing problems are 
computationaUy hard the researchers have focused on heuristic algorithms. 

1.2.4 Compaction 

Compaction is the task of compressing the layout in all directions such that the total area 
is reduced. By making the chip smaUer wire lengths are reduced which m toim reduces 
the signal delay. 

1.3 Global Routing 

This section will discuss ttie approaches to global routing problem. Gaierally tiiese 
^roaches are classified as sequential and concurrent approaches. 

1.3.1 Sequential approach 

In this approach nets are routed one by one. If a net is routed it may block other nets 
which are to be routed. As a result fliis approach is very sensitive to tiie order of the nets 
that are considered for routing. UsuaUy the nets are oidered with respect to theu- 
criticality The criticaUty of a net is determined by the importance of the net. For 
example a clock net may determine the performance of the circuit so it is considered 
highly critical. However sequencing techniques don't solve llie net ordermg problem 
satisfiictorily. An improvement phase is used to remove blockages when further routmg 
is not feasible. This may also not solve the net ordering problem so m addition to that 
•rip-up and reroute' technique [Bol79. DK82] and 'shove-aside' techniques are used. In 
rip-up and reroute the interfering wkes are ripped up and rerouted to aUow routing of 
affected nets. Whereas in shove aside technique wires that allow completion of failed 
connections are moved aside witiiout breaking the existing connection. Another approach 
rDe86] is to first route simple nets consisting of only two or tiiree tenmnals smce toere 
are few choices for routing such nets. After tiie simple nets are routed, a Sterner Tree 
algoriflim is used to route intermediate nets. Finally a maze routing algorithm is used to 
route flie r^ainmg multi-terminal nets that are not too numerous. 
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The sequential s^ioadi includes: 

a) Two terminal algorithms: 

i. Maze routing algorithms 

ii. Line probe algorithms 

iii. Shortest path based al^rithms 

b) Multi terminal algorittmis: 

i. Straner tree based al^rithnsis. 

1.3.2 Concurrent approach 

This approach avoids the ordering problem by considering the routing of all nets 
simultaneously. This approach is computationally hard and no efficient polynomial 
algorithm is known even for two terminal nets. As a result integer method programmmg 
methods have been suggested. The corresponding integer program is usually too large to 
be employed efficiently. As a result hierarchical methods that work top town are used to 
partition the problem into smaller sub problems, which can be solved by integer 
programming. 

1.3.3 Maze Routing Algorithms 

[Lee61] introduced an algorithm for routing a two terminal net on a grid in 1961. Tliis 
algorithm and its variations form the ckiss of maze routing algorithms. Maze routing 
algorithms are used to find apath between pair of points called the source(s) and target (t) 
in a planar rectangular grid graph. The areas available for routing are represented as 
unblocked vertices. The objective of a maze algorithm is to find a path between the 
source and target vertex without using any blocked vertex. The process begms with the 
exploration phase in which several paths start at the source and are expanded, until one of 
them reaches the target. Once the target is reached the vertices need to be retraced to the 
source to identify the paHb. 

1.33.1 Lee's Algorithm: 

The key popularity of Lee's maze algorithm is its simplicity and its guarantee of finding 
an optimal solution if one exists. The exploration phase is an improved version of breadth 
first search. Search is Uke a wave propagating fiom the source. The source is labeled 0 
and the wave propagates to all unblocked vertices adjacent to the source. Every 
unblocked vertex adjacent to the source is marked by 1. Then every unblocked vertex 
adjacent to the vertices with a label 1 is mariced with label 2 and so on. This process 
continues until the target is reached or no fax&ier expansion is possible. The path found 
is guaranteed to be the shortest path as well. This algorithm requires a large number of 
storage space and its performance degrades r^idly when the size of grid increases. 

1.3.3.2 Souknp*s algorithm: 

Lee explores the grid symmetrically searching equally in tiie directions away fix>m the 
target It requires a large search time. Soukup [Sou78] proposed an iterative algonthm to 
overcome this. During each iteration algorithm explores the direction toward the target 
without changmg the direction until it reaches the target otherwise it goes away firom toe 
target If the target is reached the exploration phase ends. If target is not reached the 
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seaich is conducted iteratively. If the search goes away from the target, algonttun 
changes the direction and a new iteration begins. This algorithm improves the speed of 
Lee's algorithm by a factor of 10-50. It guarantees finding a path if exists, however it may 
not be flie shortest path.. 

1.333 Hadlock's Algorithm: 

An alternative approach to improve the speed was proposed by Hadlock [Had75]. 
Algorithm is caUed minimum detour algorithm. According to Hadlock, the length of a 
path (P) connecStting a source and target can be given by (M(s. t) + 2d(P), where M(s. t) 
is Manhattan distance between source and target and d (P) is the number of vertices on 
path P that are directed away from the target. Length of? is minimized if d is minimized 
since M(8. t) is constant for any sour target pair. The exploration phase uses the detour 
number instead of labeling the wave firont number. The detour number of a path is the 
number of times that the path has tuned away ftom the target. 

13.3.4 Comparison of Maze Routing Algorithms 

These algorithms are grid-based methods. The time and space required by these 
algorithms depend linearly on their search space. Lee's algorithm guarantees findmg a 
shortest path between any vertices if a path exist, hbwever worst case occurs when source 
is located at the center and the target is located at a comer of routing area. In this case aSl 
vertices have to be scanned before target is reached. Soukup's algorithm solves the 
shortcoming of breadth first search by using depth-first search until an obstacle is seen. 
When obstacle is reached then breadth first methods is used to get around it Search tune 
is smaUer with the nature of depth-first search, however shortest path is not guaranteed. 
Worst case occurs when search goes in the direction of target, which is opposite direction 
of the passageway through the obstacle. Hadlock's algorithm is a breadth-first search 
method. The difference from Lee's algorithm is the way wave firont is labeled. In 
Hadlock's algorithm search can prefiar the direction toward the target to direction away 
from the target. Search time is shorter than Lee's algorithm. Worst case occurs when 
search goes toward the target and opposite the passageway through the obstacle. 

All maze routCTS are grid-based methods. Information must be kept for each grid node so 
a large memory space is needed for large grid. As an approximate estimate for a chip size 
of 1000031 X lOOOOX requires as much as 350Mbytes of memory and 66 seconds to route 
on net on a 15MIPS workstation. There may be 5000 to 10000 nets in a typical chip. In 
order to reduce the large memory requirements and run times line-probe algonthms were 
developed. 

1.3.4 Line-Probe Algorithms 

These algorithms were developed by Mikami, Tabuchi [MT68] and Hightower [Hig69]. 
Line-probe algorithm reduces the size of miemory requirement by using line segpients 
instead of grid nodes in the search. The time and space requirements are m the order of 
0(L), where L is the number of line segmraits produced by the algorithm. 

IhitiaUy lists 'sUst' and 'tlist' contain ttie line segments generated from the source and 
target respectively. The generated line segments don't pass through any obstacle. Dunng 
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each iteration new line segments are generated and qipended to &e corresponding ^. . 
Hne segment from slist intersects with a line segment from then the explorabon 
ends otherwise exploration continues iteratively. Retracmg tiie hne segments m ths^ 
starting from the target going through the intersection and then refracmg the hne 
segmrats in the sUst mitil source is reached can form the path. Li Mikami s algorithm 
evCTV grid node on the Ime segment is an escape point to generate new perpemhc^ 
segments. Hightower's algorithm uses only one esc^e point on a hne segment The 
disadvantage is that it may not he a^le to find apath joining two pomts. 

1.3.5 Stelner Tree Based Algorithms 

Routing algorithms explained so fer are not suitable for routmg the multi-tOTninal nets. 
Different techniques have been proposed that modify maze routmg and hne-probe 
algorithms. In one approach multi terminal nets are decomposed into sev^al two termm^ 
n5s and those two terminal nets are routed. The quaUty of the routmg m tins approach 
depends on tiie decomposition of tiie nets tiiat might result in sub-optimal routes. The 
Siral approach for mSlti terminal routing is RectUinear Steiner Tre^ ^^^^T^i 
RST is a tree witii rectilinear edges. Lengtti of tiie tree is tiie sum of tiie lengtiis of all 
edges in the tree. 

1.4 Physical Design Automation ofFPGAs 

Despite advances in VLSI design automation, ttie time it takes to market for a chip is 
unacceptable for many appUcations. The key problem is time taken due to febrication of 
chips and tiierefore there is a need to find new technologies, which minimize tiie 
fabrication time. Gate Arrays use less time in febrication as compared to fiiU custom 
chips since only routing layers are fabricated on top of pre-febncated wafer. However 
fabrication time for gate arrays is still unacceptable for several apphcatio^ ^^^J^L 
reduce tiie time to fabricate interconnects, programmable devices have been mtioduced 
which allow users to program the devices as well as tiie interconnect 

FPGA is a new approach to ASIC design tiiat can dramatically reduce manufacturing turn 
around time and cost In its simplest form an FPGA consists of regular array of 
orogrammable logic blocks interconnected by a programmable routing network. A 
pro^ammable logic block is a RAM and can be programmed by tiie user to act as a small 
logic module. The key advantage of FPGA is re-programmabiUty, 

The physical design automation of FPGA involves tiuree steps, which mclude 
partitiordng, placement, and routing. Partitioning problem in FPGAs is sigmficantly 
different tiwn tiie partitioning problem s in otiier design styles. This problem depends on 
flie architectiire in which tiie circuit has to be implemented. Placement problem m FPCjAs 
is very similar to tiie gate array placement problem. The routing problem in FPC^ is to 
find a connection pafli and program the appropriate interconnection pomts. We will 
discuss tiie architectiire ofFPGAs and tiieu: physical design cycle. 

1.4.1 FPGA Technologies 

FPGA architectiire mainly includes two parts: tiie logic blocks and tiie routing network. 
A logic block has fixed number of inputs and one output A wide range of functions can 
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be implemented using a logic block. Given a circuit to be implemented using FPGAs, it is 
first decomposed into smaller sub-circuits such that each of the sub-circuit can be 
implemented using a single logic block. There are two types of logic blo«sks. The first 
type is based on look-up tables while the second is based on multiplexes. 

1.4.1.1 Look-up table based logic blocks: 

A lookup table based logic block is just a segment of RAM. A fimction can be 
implemented by simply loadmg its lookup table into the logic block at power up. If 

fimction . - - , 

F = \ABC+A\B\C needs to be implemented' then its truth table shown below is loaded 

into the logic block. 
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F 
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0 



In this way on receiving a creating set of inputs the logic blocks sunply lookup &e 
appropriate output and set the output line accordingly. Because of the reconfigurable 
nature of the lookup table based logic blocks are also called the Configurable Logic 
Blocks (CLB). 

1.4.1.2 Multiplexer Based Logic Blocks: 

Typically a multiplexer based logic block consists of three 2-to-l multiplexers and one 
two input OR gate. 



Inputs 




Ou^ut 



MUX 



Figure 1 Lo^ Block 



The circuit within the block can be used to implement a wide range of functions. 
Programming of multiplexer based logic block is achieved by routing different mputs mto 
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the block. There are two different models of routing network: the segmented and flie non- 
segmraited. 

1.4.1.2.1 Non-Segmented model 

The non-segmented model is set up as regular grid of five horizontal and five vertical 
metal lines passing between switch blocks (S). Switch blocks are used to connect the 
wiring segments in one channel segment to those in another. Depending on the topology 
of S block each wiring segment on one side of S may be switchable to the either aU or 
some fraction of wiring segments on each side of S block. The fewer the wiring segmmts 
a wiring segment can be switched the harder the FPGA is to be routed. In addition to the 
S blocks there are flie connection blocks that are used to connect the logic block pms to 
the routing channels. Depending on the topology each L block pin may be switchable to 
either aU or some fraction of wiring segments that pass through control block. 

1.4.1.2.2 Segmented model: 

The tracks in the channels contain predefined wiring segments of the same or different 
lengths. Other wiring segments pass though the channels vertically. Each mput and 
output of logic block is connected to a dedicated vertical segment. As a result there are no 
vertical constraints. There are additional global vertical lines that provide connections 
between different channels. Connection between two horizontal segments is provided 
through an anti-fiise whfereas the connection between a horizontal segment and a vertical 
s^ent is provided tiirough a cross fuse. Programming one of these fuses provides a low 
resistance bi-directional connection between two segments, When blown anti-fiises 
connect the two segments to form a Icmger one. 

The segmented model is uniform if tiie segments in all tracks have the same length and 
the anti-fiises in different tracks m a channel are aligned in columns. The segmented 
model has advantage over non-segmented model in temis of utilization of routing 
resources. In the non-segmented model only one segment of one net can be routed on a 
track. M segmented model, the segments of several nets can be assigned to track as long 
as no two net segments are assigned to tiie same to^ck segment. The total number of 
programmable switches in tiie segmented model is higher as compared to the number of 
switches in die non-segmented model. The delay of a net is proportional to tiie number of 
programmable switches used. In segmented module tiiis number is higher. As a result 
non-segmented model is prefenred over segmented module when flie performance is 
primary objective. 
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1.5 Physical Design Cycle for FPGAs 

1.5.1 Partitioning 

In this step the circuit to be m^ed into fee FPGA has to be partitioned into smaller sub 
circuits such that each sub circuit can be mapped to a programmable logic block. Unlike 
the partitioning in the other design styles, there are no constraints on the size of partition. 
However there are constraints on the inputs and outputs of a partition. This is the unique 
architecture of FPGAs. 

1.5.2 Placement 

In this step of the design cycle, the sub circuits, which are formed in flie partitioning 
phase, are allocated physical locations on the FPGA, i.e., the logic block on FPGA is 
programmed to behave like the sub circuit that is mapped to it. This placement must be 
carried out in a manner that the routers can complete die interconnection. This is very 
critical as the routing resources of the FPGA are limited. The placOTient algorithms for 
general gate airays are normally used for the placement in FPGAs. 

1.5.3 Routing 

In this phase all the sub circuits, which have been programmed on the FPGA blocks, are 
intercoimected by blowing fuses between routing segments to achieve the 
interconnections. 

1.6 Routing In FPGA 

In this section we discuss FPGA routing for different models. 

1.6.1 Routing Algoritlim for the Non-Segmented Model 

In this section we discuss the algorithm presented by Brown [BRV92]. The routing is 
completed in two steps. 

1.6.1.1 Global Routing 

Using a global router for standard cell designs can do global Routing in FPGA. In general 
such a global router divides the multi-terminal nets into two temiinal nets and routes 
them with nunimtmi distance path. While doing so it also tries to balance the densities by 
distributing the connections among the channels. The global route defines a course route 
for each connection by assigning it a sequence of channel segments. Figure 2,1a shows a 
sequence of channel segments that a global route migjit choose to connect some pin of 
logic block at grid location 4,1 to another 0,1. The global route is also called as a course 
grind graph. Course grid graph gives a path between two L nodes through a sequence of 
S and C nodes. 
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Block grid coordinates 
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Figure2. la Coarse grid graph 
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1.6.1,2 DetaUed Routing 

Given a coarse grid graph G=(V. E) for a two terminal net. the objective of the dialled 
router is to choose specific wiring segments in each channel segment assigned durmg 
global routing. This is achieved in two steps: 

1,6.1.2.1 E3qptuKion of coarse grid graph 

In this step a coarse grid graph is expanded to record a subset of possible ways of 
implementhig the connection. The expansion is carried out while spanning the graph m 
depth first search order manner. Figure 2.1b shdws the expanded graph. 

Algorithm Gra^h-Expansion (G) 
{ 

Gd = G; 

While (DFS_COMPLETE (Gd) = FALSE) 

^ Vi = CURRENT_DFS_VST(Gd) 
li = WIRE_SEGMENT (Vi, Gd) 

If (NODEJTYPE (Vi) = C II (NODEJTYPE (Vi) == S) 

^ Vj = SUCCESS (Gd,Vi); 
Tj = SUBTREE (Gd,Vj); 

if ( NODEJTYPE (Vi) = c ) 
{ 

for (each wiring segment 1 in Fc(Vi,Vj.l) ) 
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{ 

T = DUPLICATE (Tj); 
CONNECT (Vi,Vj,TJ); 

> 

DELETE(Gd,Tj); 

> 

I£( NODE_TYPE(Vi) = S ) 

^ For (each wiring segment 1 in Fs(Vi,Vj, I), Vk) 
{ 

T=DUPUCATECrj): 
CONNECT CVi, Vj, T, 1); 

> 

DELETE (Gd,Tj); 

} 

>/*EndofwhUe*/ 
} 

The description of the algorithm is as follows: 

Function DFS COMPLETE (DO returns true if all the nodes in D are visited during depth 
first search. Fmiction CURRENT_DFS_VISIT (Gd) returns the node being visited during 
DFS. Function WIRE_SEGMENT (Vi, Gd) returns a wire segmait that connects Vi to its 
predecessor. Function SUBTREE (Gd. Vj) returns the sub-tree of Gd rooted by Vj. 
Function NODE_TYPE (Vi) returns the block type (C, S or L). If Vj is a C node Vk is 
the successor of Vj and a wire segment 1 is used to connect tiie two, then the function Fc 
(Vi, Vj, 1) returns a set of wing segments that can be used to connect Vj to Vk. Similarly 
if Vj is an S node Vk is its successor and a wire segment 1 is used to connect the two then 
the fimction Fs (Vi, Vj, 1, Vk) returns the set of wiring segments that can be set to 
connect Vj to Vk. Function DLIPLICATE (T) retams a copy of tree T. CONNECT (Vi, 
Vj, T, 1) connects the nodes Vi and Vj in T by a directed edge from Vj to T and labels 
connecting edge by L DELETE (G, T) deletes the sub-tree T fix>m G. 

1.6.1,2.2 Connection Formation 

The expanded graph Gd (Vd,Ed) contains a number of alternative paths. In this step all 
these paths are enumerated, their cost is computed and the minimum cost path is selected 
to implement the connection. Cost of a pa& is the summation of the cost of edges on diat 
path. The cost of an edge consists of two parts: Cf 0) accounts for the competition 
between different nets for the same wiring segments and CI (1) reflects the routing delay 
associated with the routing segment. 

1.6.2 Routing Algorithms for the Segmented IVIodel 
1.6.2.1 Basic Algorithm 

Green, Kaptanoglu, and Gamal have presented the algorithm for routing in segmented 
model [<awG93]. The mput to the routing problem is a set of intwvals T={I1, 12... In), a 
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set of tracks T = {Tl. T2...Tii>. Each track Ti in T extends from column 
folmm N. and is divided into a set of contiguous segments separated ^y^^}f'^^^ 
sv«tches ^ placed between two successive segments. For each mteryal I» J- ^f^FTj^ 
S RIGHT9Ii) are defined to be the leftmost and rightmost column m which the mteryal 
feo^SVm intervals in I are assumed to be sorted on their left edges. If an mterval h 
rsS«i to Sick Tj. then the segments in track Tj that ^e present m the c^ 
iannS^by the interval are considered occupied. A routmg of I ^'^^jj*! 
7f each interval B in I to a track such that no segment is occupied by 
Snnection. The routing in the segmented model can be achieved usmg flie algonthm 
SEG_ROXJTER 0 sown below. 



Algprithm SEGJIOUTER (I, T, A) 

^ Input:!, T, output: A 

For(i=l;i<n;i++) 
For(i=l;i<n;i++) 
s = GETJSEGMENT 0 J-EFTOi)); 
If (OCCUPIED(s) = FALSE 

{ 

A[I]=j; 

MARKOCCUPIED (li, Tj) 

> 



SEG ROUTER algorithm is a modified left edge algonthm The mput to t^e al^"«^ 
is the set of intervals I, the set of track T. whereas the output an array A^ Z^^i^ 
gives the number of tracks T on which the interval B is routed. In the al^ntoa 
S SEGMENT (j, C) returns a segment s on track Tj such that column C is m the spm 
F^on OCCUPIED(s) returns true if the segm«it s ^ ,<^^Vf' 
MARK.OCCUPIED (fi, Tj) marks all the segments on tracks Tj that are occupied by li. 

1.6.2.2 Routing Algorithm for Staggered Model 

The segmented model can be improved in several ways to ttds model ^ ^^Vj^ 
partitioi^ into several regions. Each region is characterized by &e ^^S^f ^/^^LT^f 
Ss in each region have equal length segments separated by staggered placemait of 
T^fZ ^tohes^There arrthree parameters with respect to the new ^o^el: number o^^ 
regions (p) number of tracks (t) and the length of segment m each region (1). 
DrtenninSon of fliese tree parameters is an important step in this 
Sese parameters can be determined by a detailed empirical analysis on ^^vera^standa^ 
b^chWks [BKS92]. If the length of segments in all the regions is same th«i the model 
is called as the uniform staggered model otherwise it is non-uniform staggered model. 

Aleorithm SEG ROUTER can be used for routing in the staggered models, hi a uniform 
stS^^^model the delay of a net is same irrespective of the routing tock^Where^ m 
SfTtaggered model the delay of a net is dependent on the rou^g track as Iflhe uumb^ 
antifii^ in the path of a net in different tracks may be different. The algonthm 
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SEG ROXJTER is not suitable for the higli perfonnance routing as it is not sufficient to 
just ^^inimize delay based on the anti-fuse elements., but also capacitance efifects due to 
unused portion of the segments spanned by a net segment and the un programmed 
switches must be considered. 

The algorithm starts routing the longest net first. This ensures that the delay due to the 
longest net is minimized which is a prerequisite for the high performance routmg system. 
For each net it finds out a track on which the net can be routed with mmrnium delay. The 
original algoriflun has three phases region selection, track selection and flie region 
resSection. In algorithm FSCR shown below, the function OK_TO_ASSIGN (Ii. Tj) 
returns true if all the segments spanned by the interval Ii on track Tj a;e unoccupied. 
Function COMPUTE DELAY (Ii, Tj) computes the delay in the mterval Ii if Ii is routed 
based on the track Tj. Function MARK-OCCUPIED (H, Tj) is same as that use m 
al^rithm SEGJROUTER. 

Algorithm FSCR (I, T, A) 
{ 

Input:!, T 
Output: A 
. Fora=l;I<n;I++) 

{ 

selected_track = 0; 
mimimum_delay = INFINITY; 
Fora=lJ<mu++) 

^ If(OKTO_ASSIGN(D.Tj) = TRUE) 

^ cuirent_delay = COMPUTE_DELAY(Ii,Tj); 
If ( imnimum_delay > currentjdelay ) 

^ ininimum_delay = currentjdelay; 
seletedjtrack = j ; 

} 

If (selected track! = 0) 
{ 

Am; , 
MARK_OCCUPIED(Ii,Tj); 

} 

else 

exit I* routing not posable */ 

> 
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1.7 Conclusion 

With emerging multimedia applications incorporating a wider flexibility of operations, 
there is an exponential increase in the demand for processmg power. The complexity, 
variety of techniques and tools, and the high computation, storage and I/O bandwidths 
associated with multimedia processing pose several challenges, particularly from the 
points of scalability, resource utilization (in terms of area and energy) and real-time 
implementation. A number of implementation strategies have been proposed for 
proc^sing multimedia data. These approaches can be broadly classified mto general 
purpose and special purpose. Althou^ general purpose processors with multimedia 
extended instruction sets pack sufficient power to handle media applications, yet their 
power consxunptions ranging from 12 watts to 74 watts do not make them feasible for 
mobile computing. These processors employ mechanisms such as out of order issue and 
dynamic branch prediction that are costly in terms of area and power. Hence there arose a 
need for new high performance processors for multimedia applications. The special 
purpose programmable processors exploit the redundancies involved in media processing 
algorithms through the use of multiple floating point and media specific execution units 
by adding the flavors of VXIW, SIMD onto the processing core: In the pro^^nmable 
scenario, a data path that has already been on chip can execute only a specific set of 
operations. Any algorithm that does not have those specific operations needs to be 
interpreted in terms of those instmctions. This consumes time and power. Special purpose 
programmable processors have higher performance with lesser control circuit complexity. 
They also have features to support operations on parallel-packed data, which exploits 
parallelism to a large extent in media processing. However the design and debugguig 
phases involved in designing such application specific progranamable processors, tends to 
drive the development cost higher. MPEG-4 and JPEG 2000 oflfer high interactivity to the 
user, which translates to a dynamic change in the computing resources at both the 
encoder and decoder units. Current technologies fall short of providing low cost and 
flexible solutions for multimedia processing. These drawbacks have lead into exploration 
of the reconfigurable architecture design space. The reconfigurable computing devices 
should be able to adapt the underlying hardware dynamically in response to changes in 
the input data or processing envux>nment. ' 
FPGAs are being used as a new approach to ASIC design, which offers dramatic 
reduction in manufacturing turnaround time and cost. The physical design cycle of a 
FPGA consists of three steps, partitioning, placement and routing. The FPGA partitioning 
problem is different from the conventional area-partitioning problem in the sense that it 
depends on the architecture in which the circuit has to be implemented. Placement 
problem is equivalent to the general gate array placement problem. However because of 
the segmented nature of the FPGA channels the routing considerations are quite different. 
In high performance FPGA designs the number of anti-fiise elements along with the 
unused tracks and antifiises must be given due to considerations as part of the routing 
phase. To match the needs of the future multimedia applications, we have proposed the 
first of a series of tools intended to help in the design and development of a dynamically 
reconfigurable multimedia processor. 
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2 Toolset 

2.1 Profiler, Partitioned Placement and Routing 

Our research effort aids the design of a dynamically reconfigurable processor through the 
use of a set of analysis and design tools. These are'intended to help hardware and system 
designers arrive at optimal hardware software co-designs for applications of a given 
class. The reconfigurable computing devices tiius designed will be able to adapt the 
underlying hardware dynamically in response to changes in flie input data or processing 
environment The methodology for designing a reconfigurable media processor involves 
hardware-software co-design based on a set of three analysis and design tools[AK02]. 
First tool handles cluster recognition, extraction and a probabilistic model for ranking the 
clusters. Second tool, provides placement rules and feasible routing architecture. Third 
tool provides rules for data path, control units and memory design based on the clusters 
and their interaction. With the use of all three tools, it will be possible to design media 
processors that can dynamicaUy adapt at botti flie hardware and software levels in 
embedded applications. The input to the first tool is a compiled version of the application 
source code. Regions of the data flow graph obtained firom the source code, which are 
devoid of branch conditions, are identified as zones.Clust©rs .are identified in the zones, 
by rq>res^ting candidate instructions as data points m a multidimensional vector space. 
Properties of an instruction, such as location in a sequence, number of memory accesses, 
floating or fixed-point computation etc., constitute the various dhnensions. As shown in 
Figure-3 clusters obtained &om the previous tool are placed and routed by Tool #2, 
according to spatial and temporal constraints (Figure 4). The processor (of the compiler) 
can be any general purpose embedded computing core such as an ARM core or a MIPS 
processor These are RISC cores and hence are similar to general purpose machines such 
as UltraSPARC The output of the tool is a library of clusters and their interaction. (A 
Cluster comprises of sequential but not necessarily contiguous assembly level 
instructions). The clusters represent those groups or patterns of instructions that occur 
frequently and hence qualify, for hardware implementation. To maximize the use of 
reconfigurability amongst clusters, possible parallelism and speculative execution 
possibilities must be exploited. The rest of this chapter explains the steps included in the 
partitioner, profiler tool. 
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3 A Framework for the Design of the Routing 

Architecture of a Dynamically Reconfigurable Media 
Processor 

3.1 Introduction 

CommerciaUy available FPGAs have used symmetrical, row-based, sea-of-gates or 
hierarchical routing architectures. We have recentty proposed a tool set that wiU aid the 
desiga of a dynamically reconfigurable processor through the use of a set of analysis and 
design tools. In this paper we propose a heterogeneous hierarchical routing architecture 
Compared to hierarchical and symmetrical FPGA approaches the building blocks are of 
variable size. Due to variable traffic among clusters and non-symmetnc characteristics, 
different types of switches are needed at each hierarchy. Switches are vanable even the 
same hierarchy level. This results in heterogeneity between groups of bmlding blocks at 
the same hierarchy level as opposed to classical H-FPGA approach. The ana of flus paper 
is to show a framework for the design of the heterogeneous hierarchical routmg 
architecture. 

3.2 Routing Architectures 

First we look at the existing routing architectures and then explain the proposed 
architecture. Routmg architecture consists of wiring segments, switch boxes and the 
building blocks. CommerciaUy available FPGAs have used symmetrical, row-based, sea- 
of-gates or hierarchical routing architectures (Figure 5). Xilinx and QuickLogic use 
syinmetrical, Actel and Cioss-Point use row based, Plessey. Algotronix use sea-of-gates. 
Altera and AMD use hierarchical routing architecture (Table 2). 
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Figure 5 Rottting Architecture Overview 

In all of these FPGAs the mtCTCoimections and how they are progranuned vary. Cunrraitly 
there are four technologies in use. They are: static RAM cells, anti-&se. EPROM 
transistors, and EEPROM transistors. Depending upon the appUcation, one FPGA . 
technology may have features desirable for that application ( Table 3 ). 
Static RAM Technology - In the Static RAM FPGA programmable connections are 
made using pass=»transistors, transmission gates, or multiplexers that are controlled by 
SRAM cells. The advantage of this technology is that it allows fast m-cucmt 
reconfiguration. The major disadvantage is the size of the chip reqmred by the RAM 
tedmology. 

Anti-Fuse Technology ~ An anti-fuse resides in a high-impedance state; and caai be 
programmed into low impedance or "fused" state. A less expensive than the RAM 
technology, this device is a program once device. 

EPROM / EEPROM Technology - This method is the same as used in the EPROM 
memories. One advantage of this technology is that it can be reprogrammed without 
external storage of configuration; though the EPROM transistors cannot be re- 
programmed in-circuit. The following table shows some of the characteristics of the 
above programming technologies. 
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-4k 10-20ff 



-4k 10-20£r 



3.3 Proposed Architecture: Heterogeneous Hierarchical Routing 
Architecture 



3.3.1 Introduction 

Ini profiling and partitioning step recurring sequences of instructions are detected and 
assigned to be run on hardware. This step also takes into account flie control flow 
stracture of the code. Each bulk of pure data dependent code in betwerai the control 
structures is assigned to be a zone. Then the partitioning tool runs a longest conamon 
subsequence type of algoriflun to find the recurring patterns between the zones to be 
assigned to run on hardware. This tool also ^ves information about each zone such as 
how fiequently it is used, ii^ut output pins, size, dependency between zones, etc. 
Placement tool threats the zones as blocks with the ^ven information places the more 
related zones dosra- to each other. As a result the interconnection pattern is known prior 
to execution. That helps to exploit the locaKty fliereby reduce the interconnection 
lequiiements. 

Symmetrical FPGA (Figure 6) architecture suffers tmm lower density, speed and 
performance issues compared to conventional gate arrays. Hierarchical interconnection 
structures for FPGAs have been proposed to overcome these problems. The connection 
and switch block flexibiUties may be different at each level of hierarchy. Lower levels 
need lesser flexibility because they have fewer connections to route. Aggarwal [AG94] 
says that H-FPGAs (figure 7) can implement circuits with fewer routing switches in total 
compared to symmetrical FPGAs. According to U [LI99], for H-FPGAs the amount of 
routing resources required is greatly reduced while maintaining a good routabiUty. It has 
been proved that the total number of switches in an H-FPGA is less than in a 
conventional FPGA under equivalent routabiUty [LAI97]. Having fewer switches to route 
a net m H-FPGAs reduces tiie total csqiiacitance of the network. Therefore it can 
implement much faster logic with much less routing resources compared to standard 
FPGA. H-FPGAs also offer advantages of more predictable routing with lower delays. 
Hence tiie density of H-FPGAs can be higher than conventional FPGAs. The proposed 
[AK02] partitioning tool analyses the appUcation code at assembly level and extracts the 
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control-data flow graph. The output of the tool is a library of building blocks and their 
mteraction. (A building block comprises of sequential but not necessarily contiguous 
assembly level instructions). The building blocks represent those groups or patterns of 
instructions that occur firequoitly and hence qualify for hardware implementation. Initial 
building blocks are formed from regions that do not contain branch instructions. These 
regions are referred to as zones. To maximize the use of reconfigurability amongst 
zones, possible parallelism and speculative execution possibilities is exploited. The 
Placement tool places the building blocks diat are exchangmg data more frequently close 
together. Compared to symmetrical and hierarchical fpgsi approach the building blocks 
(zones) are of variable size. Classical horizontal, vertical channel won't work in our case. 
Even if we inline the zones that will be a waste of space. The consistent wire bandwidth 
won't work because we have variable traffic with in the modules as well. Due to variable 
traffic among clusters and non-symmetric characteristics, different types of switches are 
needed at each hierarchy. Switches are variable even at the same hierarchy level. This 
results in heterogeneity between groups of buildmg blocks at the same hierarchy level as 
opposed to classical H-FPGA approach. The aim of this chapter is to incorporate the 
studies mentioned above with our approach and to show a framework for the design of 
the heterogeneous hierarchical routing architecture. 




Figure 6: Symmetrical FPGA Figure 7: Hierarchical FPGA 
3.3.2 Proposed Architecture 

Clustering algorithm will give building blocks such as B={Bi.B2,...Bk }where BieB. 
Based on the data dependency between the building blocks subsets of B which are not 
necessarily disjoint are formed. They are not disjomt because the dependency analysis of 
the building blocks might end up with the decision of inserting a building block into two 
subsets. If file data dependency between the building block and the candidate clusters are 
of the same level the clustering algorithm can't decide where to insert the building block 
so it provides multiple copies of the building block to the candidate clusters. As a rrault 
there possibly will be multiple copies of a building block all over the hierarchy. 
Clustering algoriflun uses threshold-based approach to form the subsets so even though 
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logic blocks highly interacting each other are grouped there will be data dependency with 
tiie other clusters. When chisters are formed the data dependency, between the clusters 
are also kept. With this information, instead of doing an initial random placraifflit of 
bmlding blocks, now placement tool can make inteUigent decision in placmg the buUdmg 
blocks. Each building block within a cluster is placed close to each other and forms a 
node in a hierarchy. Having the information of data dependency between the clusters help 
the decision of placing the clusters efficiently to form the higher levels of hierarchy. 



33.2.1 Overall Architecture 

Clusters of building blocks form level-1 M modules. Similarly clusters of M modules 
form level-2 C modules. We define two types of switches as local (LS) and gateway 
switches(GS). Building blocks are named as B={B,.B2,...Bk }where BjeB. Clusters of 
buflding blocks inside level-2 are M modules. M={M,JM2,...M,} where MjeM. Clusters 
of M modules inside level-3 are C modules. C={C,,C2.- -Cm} where CkeC. A buddmg 
block Bi in module Mi of cluster Cu can establish a connection to ano&er bmldmg block 
in some other cluster Cp through the local and gateway switches. On the o Jer hand 
Bmlding block Bi in module Mj of cluster Ck can estabUsh a connection to another block 
within the same cluster Cu through local switches. Assuming that tiie clustenng algonthm 
gave the buildmg blocks of B,.B2,. • -Bs as shown in Figure-8; Figure-9 shows fte overaU 
routing architecture. Local connection is any connection within the same C module, and 
it uses only local type of switches(LS). If a block in module 2 of cluster I sends data to a 
block in Ml of cluster 2, data goes through the following nodes : Source Block, LS m 
M2, LS in CI, GS in C2. GS in Levels, GS in C2, LS in C2, LS m MI, Destmabon 
Block(figure-10). 




Figure-8 Builmg block 
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Level-4 




Figfire 9 Hetavgeneota Hierarchical routing architecture 




Figure-10 SwitcMng 



3.3.3 Switching Architecture 

Placement knows'the relationship between the zones and places the related zones closer 
to each other. This is the advantage over the other placement algonflim. This wiU help me 



1119432 25 



I 



# 

router in the process of decision of Ihe switching architecture. Since we aheady taiow 
that the closer ones are related and more frequently changing data then the switches will 
be more flexible inside each cluster. , . . j v«- _* 

Dynamic switches are employed to handle the variable traffic load m different 
configurations. A switch may stay in the idle mode and process some short sequences of 
mstruction that were not assigned to any zone. These mstructions can be part of an else 
statement that is very less likely to execute so clustering tool decides not to run them on 
the hardware. Instead of assigning the execution of these instructions by the general- 
purpose processor, it is handled within the reconfigurable unit. This saves commumcation 
time and switching time. When traffic load between two modules incre^e a^er a 
configuration, an idle switch that are in between fliese modules can be activated to handle 
the traffic. Each switch might have copies of pins to resolve the issue of hendmg 
problems in the switches. Next section explains the procedure of how to place the 
building blocks that belong to the same cluster. 



3.3^ Building Block Placement 

Let n be ttie number of building blocks in a given cluster. Qj be the numba: of 
connections between bmlding blocks Bi and Bj and be the amount of data fraffic m 
terms of average number of bits transferred between the blocks Bj and Bj where 
1 < I ^ 71,1 ^ J ^ « . Then cost of data exchange between two Ubrary modules Bi and Bj is 
defined as : 

I 

Building blocks are placed based on the time to execute the circuit (D constraint. Time is 
measuroi by the profiling tool based on the trace files specifying that the circuit should 
run in less than T units of time. The algorithm has three phases. Steps one md two 
ignores the fact the building blocks are variable in size. First step also i^^^s 
<Snstraint. In the first phase placement is done on a grid style to specijr if the block 
should be placed to north, south, east or west of another block using the dependency 
information. If the output of first phase matches the timmg constramt flien second ptose 
is omitted and the suggested placement orientation is fed to the placement algorithm, 
which will aUocate the actual space for the building blocks. If the timing constramt is not 
met, then second phase increases the wiring capacity or changes the place of bmldmg 
blocks tillrit matches Uie timing requiremeat 

3.3.4.1 First Phase : 

Objective is to place the pairs of building blocks that have the most costly data exchange 
closest to each other. As the cost of the link decreases algorithm tolerates to have a 
Manhattan distance of more than 1 hop between the pairs of building blocte. This phase 
guarantees the best area allocation because the building blocks are placed based on ffieir 
dependency leadmg to usage of less number of switches to establish a connection 
between them. Integer programming technique is used to make the decision of the 
orientation of the building blocks with respect to each other. Given that there are n 
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ma^lS PKx,y) denote the (x,y) coordinates of the buUding block B| and.no other 
buUding block has the same (x,y) coordinates. The objectove functoon is : 

33.4.2 Second Phase: 

Placement graph is the output of the first phase. That circuit's timing is ^easinred m toe 
s^d phJTlf it feils then the area is compromised by ei&er movmg one or more 
Sing blocks or increasing the wiring edacity on flie pafli from node » to no^ . 
ObvioiSly this wm result in extra usage of area by the additioi^ wires or switches^ 
runs till area and time values are below threshold. Fi^t phase giv^^e 
^Sment orientation information; second phase ^ves the switching and vn^g 
5equi«anents. All these information is then fed to the placement tool, which uses a 
modified version of simulated annealing method. ^ ^ a 

Similarly as the profiler tool gives the information about the dependency betwetm &e 
cZ^ ^« well, le placement at the cluster and higher levels -IXf^^f^'^^'f^ 
explained above. Figure-Ua shows the cost matrix of a given 6 blocks (A.B.QD.E^). 
Those 6 nodes are treated as points to be placed on a 6x6 matrix (figure-lib). Output of 
S nhai fe^WS figure-^ If timing constraint is not satisfied then second phase 
cCse t J^ast^^^ capaci^ between the building blocks that has heavy 
traffic as shown in figure-l Id. 



Example : 

Givm the cost matrix and tiie empty grid matrix 
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Figure n (a: Costmatrixb: 6x6 grid, c: Ou^ of First Phase, d: Output of Second Phase) 



3.4 Routing Algorithm ^ , • u, 

ProfiUuK tool gives the tuning analysis of each building block. The time that lo^c block 
^Sds to be JUuted is speLlatively priori known. Profiling tool also g^v^ the 
SSendency information between the building blocks. Pif^.'^^JJ^^J^^T^^^^^ 
reduce the complexity of finding the route between two bmldmg blocte. In ^ditaon to 
^t we propo^to allocate address table on chip, which stores the ^^jatioji of a C le^l 
Sdule on chip. If a building block needs to send a data to anotiier block it checks Je 
Sl^Ss ^le L find out ii which hierarchy level and in whxch C l«vel modirfe^e 
destination node is located. Tliis wUl help to find the placj of the cluster wi^u^ 
T^^l Shortest path or best available path search algonthm is then reduced to C 
mSS^Lel D^g the reconfigurations the address table wiU bring the cost of update 
and deletion. 



3.4.1 Buffering . , o 

Buffering is needed when receiver needs to process bulta of data at a tone ( 
a^e iom an image ). In that case it waits till all 8 rows ^ filled a^d ^en s^ 
mocessine the data. For a given context if we know consumer demands data in a block 
^erZf^^e receiver should rearrange flie incoming data format. Sender and receiver 
T^d be context aware. Buffers are only kept at fee receiver side^ ^1"^^^!^*^°^; * 
S>re data it simply dumps the data to the bus as soon as it is available. R«:eiver should 
beiw^fo?rS,Lxt of eachrequest andma^ 

' to ^eut coUusion. If the -receiver needs to get data &om more than one receiver then 
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those senders which are in the ok Ust are allowed to transmit data other requests should 
be denied. This is again handled by the collusion prevention mechanism. The connection 
service mechanism brings a control overhead cost however results in controlled router 
service, use of resources optimally and parallelism. 



3.4.2 Possible Connection Services 

3.4.2.1 Connection Oriented: 

First phase is to establish connection, then transmit data and finally release connection. If 
producer sends data faster than the receiver can process then a flow control mechanism is 
needed. Sender needs to be aware of whether or not the receiver is able to keep up. 

3.4.2.2 Two Way connection: 

The flow of data is both directions between sender and receiver only but the receiver may 
not always be ready. It can be processing some other task. There can be automatic 
buflfering and queuing done within the receiver's hardware. Second solution is consumer 
may send a signal to producer to stop sending but in that case buffer is needed at fte 
producer side as well. An alternative solution is to speed up the receiver hardware by 
inoreasing ttie local clock frequency. 

3.4.23 Buffering 

Buffering is needed when receiver needs to process bulks of data at a time ( 8 rows at a 
time from an image ) . It waits till all 8 rows are filled and then processes the data. For a 
fflven context if we know consumer demands data in a blodc manner the receiver 
should rearrange the incoming data format. Sender and receiver should be context aware. 
Buffers are only kept at the receiver side. Producer doesn't store data it simply dumps 
the data to the bus as soon as it is available. Receiver should be aware of the context of 
each request and make a decision based on the priority in order to prevent coUusion. If 
the receiver needs to get data from more than one receiver then those senders which are 
in the ok Ust are aUowed to transmit data other requests should be denied. This is again 
handled by the collusion prevention mechanism. The connection service mechamsni 
brings a control overhead cost however results in controlled router service, use of 
resources optimally and parallelism. 



3.5 Conclusion 

We propose that the routing tool will provide optimum interconnection pathways 
between different hierarchy levels with variable switching. Providing wide range of 
switches that has different channel widths, variable number of connections, pnonty 
switching and task multiplexing mechanisms (computation or switching) wiU enable cost 
efficient resource allocation. 
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3.6 A Framework fbr the Routability of a Dynamically 
Reconfigurable Processor 

3.6.1 Introduction 

DeHon[MIT96] showed the switching requiranents using Rent's Model on Hierarchical 
Netwoika. Stroobandt[STO96] and Donath[DON79] showed the extensions of Rent's rule 
on tiie decision of Rent exponent and average interconnection Imgth, finally 
Brown[BR093] showed a Stochastic model for routability in symmetric FPGAs. 
According to Donath, hiracarchical ^pioach gives a better estimate of interconnection 
lengths. Our aim in this paper is to incorporate these studies with our approach and to 
show a fiamework for the routability wifli variable sized lo^c blocks; connection and 
switch block flexibilities as weU as variable traffic for the dynamically reconfigurable 
processor. There is a need to estimate the power consumption, area and routability fix>m a 
design description of the circuit. Interconnect capacitance has not scaled down as much 
as gate capacitance with the advances in sub taicron technology. C^acitance is an 
important component of the power consunq[>tion with the interconnect loading. We need 
to incorporate the wire capacitance into power consumptioiL In order to do that we need 
to estimate tiie average interconnect capacitance, which requires an estimate of the 
average intercoimect length. Hiis measure is also necessary for the routability equation. 
Predicting the interconnection lengflis and the total number of connections before 
placement is very important since this procedure reduces the requirements that help to 
limit the solution space for the placemait algorithm and helps to choose the most suitable 
architecture for the design. Manageanent of interconnection length is critical in terms of 
the amount of space required, time delay for signals as tiie capacitance increases with the 
wire length. 

In ordCT to find the average connection length researches have been using two ways .One 
is to place the building block witii a ^ven routing architecture and find the exact wiring 
that is necessary. However we want to predict the wiring reqmrement even before the 
placement. One solution for that is the use of Rent's rule, which defines a relationship 
between the building blocks forming a module and the numba of external pins of lhat 
module. Rent's rule is a partitioning scheme. A net list is partitioned then each partition 
is partitioned again till gate level or building block level is reached. 

Suppose a logic block is connected to P= XB' where P is the number of external pins on a 
module, B is tiie number of blocks per module and X is tiie average number of pins per 
block and r is called Rent's exponent which is a measure of interconnection complexity 
of the logic. P and r are fixed for a given logic net, therefore X and B are tiie adjustable 
parameters. Rent's rule models the behavior of a scheme that keeps the number of 
interconnections betwe^ sub-designs as low as possible leading to many short 
interconnections and fewer long ones. Rrait's rule holds equally well for partitioning of 
logic into hierarchical model[Mrr96]. Predicting the interconnection lengths and tiie total 
number of connections before placement is very important since this procedure reduces 
the requirements that helps to limit flie solution space for tiie placement algpritiim and 
helps to choose the most suitable architecture for the design. Management of 
interconnection lengtii is critical m terms of tiie amount of space required, time delay for 
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routSspredictions can be made. The exponent value ranges from 0.4 for very smple 
S'- SS^^nections up to 0.8 for complex irre^lar 

lo^c chips. With Rent's rule we can define the nmnb^ of pii^ ^^^of^Yis s^to 
of toe hi^hical mode. If a (total number of connections/total number of pms) is set to 
be a constant (obtained from tiie circuit ) then : 

Total number of connections at level I is given by Cn = o^* Pi 

Rent's rule says that a module at some hierarchy level represent 
SSber^eSSial pins for that module or rent expon^t for that module - 
vZe can be scaled to represent the whole design. Rent's model has been apph^ ^ 
s™^c^ ftS and hi^chical ^ga approaches. These ««^Wt^*^ use M^am^ 
plac^ent with fixed type of ablocks. At each level riumber of bml<tog blocks 
foiS a module is fixed. It is clear that, property of one module can r^resent the whole 
how^a in our case profiler tool gives out zones that form the basic bmldmg 
S Sr^^Tonris variable in terms of complexity, contents arid size PlacemerU 
too» toe mo'related zones togetoer to form a hier^hy Each -Jf^^ « ^^^^ 
Sdi^erent number of and d^erent types of zones. Accordmg to ^f^2;,f^^*^^ 
nile is not appUcable if the interconnection complexities are different for Wocte or 
WeraJlwcal kvels In such a case average number of pins at each hierarchy l^el may 
S sThe rn^ed toe Rent's rule by deriving local Rent exponent capable of deabng 
^^toe in toe complexi^ toat results wito a more accurate "^t«corm^tion 

^mp^ir^easure. We us^Marck's metood for deriving Ren 's exponent ^d toe 

SSfm'del of Brown in our work. In ^J-^^Zl^Xl^T^t 
methodology to represent toe average interconnection lengto wito Rent s Rule and toe 

routability wito stochastic model. 



3.6.2 Methodology 



Routabilitv is toe likeltoood that a circuit can be successfully routed in a given routmg 
SS^ N^iii of routing switches and toeir distribution over 
ae parameters of toe stochastic model. Additional parameters such as total number of 
connections (Ct\ lengto of each connection (^), number of wiring tradss per chamel 
SiS of ^mection and switch blocks are needed to represent the circmt to be 
muted However parameters Ct and R remain unknown before toe actu^ partitiomng 
S pla^^Iofrr^t. Rent's rule defines arelationship for partitiomng logic mto 



sob-modules. 



3.6.3 Deriving Interconnection Length 
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In Marck's niodel[MAR95] nets are the interconnections between blocks through which 
information is transported one block to another. A net can transport information to 
multiple blocks. Modules are defined as a set of interconnected blocks. Each net 
connecting two modules are connected to pins that are the interfeces of the modules. 
Marc defines as the maximum number of pins per module and M™x as maxmium 
number of modules allowed. In our case, series of instructions not necessarily sequential 
are grouped together to form the Ubrary modules with the clustering algorithm. The 
algorithm uses the dependency between instmctions, fi«quency of occurrence of patterns 
and control flow graph analysis information for grouping. The tool gives the library 
modules (programmable building blocks), dependency between blocks, firequency of 
occurrence of each block and number of mput and output pins of each block. Cluster 
sizes are optimized such that the modules can execute in paraUel. Smce mterconnection 
behavior between modules is known beforehand, modules that are more likely to be 
connected are grouped together at the same hierarchy level. This way the 
mterconnections between Ubrary modules are reduced resulting with small number of 
edges crossing module boundaries. Since we have the accurate information about the 
modules and their interaction we don't need to define a limit on the number pms or 
modules, hi 6ur model complexity of building blocks and the modules at each level of the 
hierarchy are variable. Each module may have drasticaUy different rent exponent and 
assigning a single rent exponent for the combined design will lead to wrong estunations 
of the properties of the layout such as the average length and total number of connections. 
A modification to Rent's model is needed to overcome this limitation. We do this by 
representing interconnection over varying complexity over tfie different hierarchical 
levels and find an exponent value for each level by. 

r{Ba) is a Amotion of Bm. 

."^ ' l0g(fin) 

The slope of the log-log diagram of r(B») for each m gives the rent exponent 

According to Gamal[GAM81] the length of each connection is drawn fix)m Pl which is a 
geometric distribution with mean length R . Brown uses tiiis information m his model. 
However since Gamal used fixed type of blocks in his work we can't directiy assume tiiat 
tiie length distribution in our case is geometric. Clustering tool analyzes both the static 
code and trace file to form clusters. That analysis calculates how many times a 
dependency occurs on static code. MPEG4 is highly user mteractive so each mode of 
operation will initiate different set of building blocks to be executed. The trace files help 
to analyze how many times the estimated dependencies occur. These two observations 
provide a probability model for tiie expected number of occurrence of dependencies and 
the length We then use Brown's stochastic model to predict the routability. Procedure to 
find average interconnection loigth is as follows: 

I. Total number ofpins are found tiirough the profiler tool 
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2. For each hiwarchy level the number of connections to pin ratio is known 

3. Find the average connection lengfli for each level using Marck's model. 

4. Find ttie average connection laigth for the overall design. 

3.6.4 Routability Model 

RoutabiUty is the ratio of expected number of successfuUy routed connections to the total 
number of connections. 

Cr total number of two point connections {CuCi^Ccr ) 
Rc^Rci^RccT event that the connection is successfully established 
P(/?o) ; probability of connection i to be successfully routed 

Routability = ^=!^^ 

Gateway switches, like local switches, have variable flexibiUties over the hierarchical 
levels. As the hierarchy level increases the complexity of the switches wiU increase. 

3.6.5 Probability of Successfully Routing a Connection 

For W hacks in each routing channel at a specific point we define pi as flie probabiU^of 
the track to be occupied after some circuit had been routed. According to Brown This 
event can be approxunated by Poisson distribution, hi Figure-5 we have shown the senes 
of events to establish a connection between two buildmg blocks of different C modules. 

Xi: Event fluit the logic block pm associated with Q can connect to at least one track at 
the first local switch block. ( fliere are Fl tracks fliat can connect a building block but any 
number oftiiose tracks may akeady be in use by other connections. . v <- 

X2: Event that pin associated with the first LS can connect to the local switeh of the C 
module that the building block belongs to. . 
X3: Event that pin associated wit tiie Ls of tiie C module can connect to the GS of that 

same C module. ». * • -j 

Si^Sz &, : Events that Ci can successfiiUy reach at least one track on tiie outgomg side 

of tiie first, second up to n* gateway switch. 

X4: Event tiiat GS of tiie C module where tiie destination block stays can connect to the 
LS of the same C module. 

Xs : Event tiiat LS of tiie current C module can connect to flie LS of tiie M module tiiat 
the destination block belongs to. a. 
X6 : Event tiiiat LS of the M module ttiat destination block belongs to can connect to the 
destination block. . 
LCi is tiie lengfli of connection between two blocks. Typical connection Q is 
successfully routed only if all tfie events 
Xi. X2. X3. Si. S2...S„. X4. Xs. X« occur. 
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3.7 Conclusion and Future Work 

In this chapter we present a firamewoik to inodel the routabUity of a given routing 
architecture. We give an overview of the proposed tool set for the design of a 
dynamicaUy reconfigurable processor and then we define the routing architecture. Basal 
on that routing architecture we show the methodology about how to modify the Rent's 
model and Brown's stochastic model to predict the routabiUty. We are cuirentiy working 
on the profiler tool to form the building blocks and their interaction &om MPEG4 code. 
Next step is to list al possible events for a possible connection scenario and develop a 
probabilistic model for each case. 



4 Testing Strategy 

Circuit net list and placement file formats will be modified to compare the routing 
performance. A placement file is first converted into LUT format then the connectivity 
information between the Usts of UJTs obtained is used to cluster the LUTs related to 
eadi other using the clustering tool. The output is then fed into routing tool This way we 
will be able to use existing bencbmarks and verify the functionality of the proposed 
model. 
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